An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication

نویسندگان

چکیده

This article proposes a novel hardware accelerator for the inference task with sparse convolutional neural networks (CNNs) by building unit to perform Image Column ( Im2Col ) transformation of input feature map coupled systolic-array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps GEMM computation maximize parallelism. We propose that uses set distributed local memories connected ring network, which improves energy efficiency and latency streaming only once. The in can be dynamically configured as multiple units square-shaped systolic arrays or single tall array. dynamic reconfigurability enables effective pipelining operations attains high processing element utilization wide range CNNs. Further, our is sparsity aware, improving performance effectively mapping maps weights elements, skipping ineffectual unnecessary data movements involving zeros. prototype, SPOTS, on average 2.16 \( \times \) , 1.74 1.63 faster than Gemmini, Eyeriss, Sparse-PE, are prior accelerators dense CNNs, respectively. SPOTS also 78 12 more energy-efficient when compared CPU GPU implementations,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sparse Matrix Multiplication on CAM Based Accelerator

Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating sparse matrix by sparse vector and matrix multiplication in CSR format. Using functional simulation, we show that the proposed ReCAM-based accelerator exhibits...

متن کامل

Sparse Matrix-Vector Multiplication for the ClearSpeed Accelerator

Sparse matrix-vector multiplication (SpMV), y = A * x, where A is a sparse matrix and x, y are vectors, is a common computational kernel in many application domains that presents challenges for performance optimization. The high ratio of memory operations to computation and the lack of data reuse cause sparse matrix-vector multiplication to be bandwidth intensive. Additionally, the application ...

متن کامل

Hyper-Systolic Matrix Multiplication

A novel parallel algorithm for matrix multiplication is presented. The hyper-systolic algorithm makes use of a one-dimensional processor abstraction. The procedure can be implemented on all types of parallel systems. It can handle matrix-vector multiplications as well as transposed matrix products.

متن کامل

Coded Sparse Matrix Multiplication

In a large-scale and distributed matrix multiplication problem C = AB, where C ∈ Rr×t, the coded computation plays an important role to effectively deal with “stragglers” (distributed computations that may get delayed due to few slow or faulty processors). However, existing coded schemes could destroy the significant sparsity that exists in large-scale machine learning problems, and could resul...

متن کامل

FPGA accelerator for floating-point matrix multiplication

This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Architecture and Code Optimization

سال: 2022

ISSN: ['1544-3973', '1544-3566']

DOI: https://doi.org/10.1145/3532863